Linux mem 1.3 分页寻址(Paging)机制详解

您所在的位置:网站首页 linux 分页 Linux mem 1.3 分页寻址(Paging)机制详解

Linux mem 1.3 分页寻址(Paging)机制详解

2023-11-26 10:04| 来源: 网络整理| 查看: 265

文章目录 1. X86手册定义1.1 paging modes1.2 `4-LEVEL PAGING`和`5-LEVEL PAGING`模式1.2.1 `4-LEVEL PAGING`1.2.2 CR3 format1.2.3 PML5/PGD entry format1.2.4 PML4/P4D entry format1.2.5 PDPT/PUD entry format (1-GByte Page)1.2.6 PDPT/PUD entry format (Page Directory)1.2.7 PD/PMD entry format (2-MByte Page)1.2.8 PD/PMD entry format (Page Directory)1.2.9 PTE format (4-KByte Page) 1.3 Access Right1.3.1 访问模式1.3.2 访问权限(access rights)1.3.2 Protection Keys 1.4 PAGE-FAULT EXCEPTIONS1.5 ACCESSED AND DIRTY FLAGS1.6 MEMORY TYPING1.6.1 PAT is Not Supported1.6.2 PAT is Supported1.6.3 Caching 1.7 CACHING TRANSLATION INFORMATION1.7.1 Process-Context Identifiers (PCIDs)1.7.2 Translation Lookaside Buffers (TLBs)1.7.3 Paging-Structure Caches 2. 代码解析2.1 Paging的创建2.2 Paging查询2.3 Paging属性设置(R/W/X)2.3.1 protection标志的转换2.3.2 mprotect() 2.4 writenotify2.5 mm切换 参考资料:

1. X86手册定义

在x86架构下有两种地址转换模式。Intel® 64 and IA-32 architectures software developer’s manual / Volume 3 / Chapter 4:

段寻址(segmentation),将逻辑地址(logical addresses)翻译成线性地址(linear addresses)。页寻址(Paging),将线性地址(linear addresses)翻译成物理地址(physical address),同时检查检查对地址的访问权限(access rights)和cache类型(memory type)。

Linux下只使用页寻址模式。

1.1 paging modes

在这里插入图片描述

x86支持以上的四种paging模式:

32-bit paging。32位模式,线性地址宽度32bit,物理地址宽度40bit,pagesize支持4k/4M。PAE paging。32位模式,线性地址宽度32bit,物理宽度最大52bit,pagesize支持4k/2M,支持可执行属性的配置。4-level paging。64位模式,线性地址宽度48bit(需要4级mmu表映射:pgd→pud→pmd→pte),物理宽度最大52bit,pagesize支持4k/2M/1G,支持可执行属性的配置,支持PCIDs和protection key的。5-level paging。64位模式,线性地址宽度52bit(需要5级mmu表映射:pgd→p4d→pud→pmd→pte),物理宽度最大52bit,pagesize支持4k/2M/1G,支持可执行属性的配置,支持PCIDs和protection key的。

重点属性:

execute-disable access rights(可执行权限):可以防止软件从其他可读的页面中获取指令。PCIDs(process-context identifiers)(进程上下文标识符):在4级分页和5级分页模式,软件可以启用一种功能,逻辑处理器可利用该功能为多个线性地址空间缓存信息。 当软件在不同的线性地址空间之间切换时,处理器可以保留缓存的信息。protection keys(保护键):对于4级分页和5级分页,每个线性地址都与一个保护键关联。软件可以使用保护密钥权限寄存器禁用对与该保护密钥相关联的线性地址的某些访问权限。

paging相关的全局寄存器概述:

regbitnamedescript描述CR0-PGenables paging使能分页功能CR3--pgd physical addressmmu pgd表的起始物理地址-----CR4-PAEpaging modes64bit模式下物理地址宽度32bit/40bitIA32_EFER-LMEpaging modes32bit/64bit模式CR4-LA57paging modes64bit模式下线性地址宽度48bit/57bit---Modifiers-CR016WPwrite protected只读地址对特权模式写操作的保护CR44PSEpage size extend32bit模式的页尺寸扩展CR47PGEpage global enable全局页面CR417PCIDEprocess-context identifiers进程上下文标识符CR420SMEPsupervisor-mode fetch protect特权模式fetch用户模式指令保护CR421SMAPsupervisor-mode access protect特权模式存取用户模式数据保护CR422PKEprotection keys enable保护键使能CR423CETcontrol-flow enforcement technology控制流实施和影子堆栈CR424PKSprotection keys supervisor特权模式保护键使能IA32_EFER11NXEexecute-disable配置禁止可执行权限

相关属性配置寄存器的含义:

CR0.WP允许保护页面免受特权模式(supervisor-mode)写操作。如果CR0.WP = 0,则允许对具有只读访问权限的线性地址进行特权模式写访问;如果CR0.WP = 1,则不是。(无论CR0.WP的值如何,都不允许对具有只读访问权限的线性地址进行用户模式(user-mode)写访问。)第4.6节解释了如何确定访问权限,包括特权模式(supervisor-mode)和用户模式(user-mode)访问的定义。CR4.PSE启用4 MB页面进行32位分页。如果CR4.PSE = 0,则32位分页只能使用4 KB页面;如果CR4.PSE = 1,则32位分页可以同时使用4 KB页面和4 MB页面。有关更多信息,请参见第4.3节。(与CR4.PSE的值无关,PAE分页,4级分页和5级分页可以使用多个页面大小。)CR4.PGE启用全局页面。如果CR4.PGE = 0,则不会在地址空间之间共享任何转换;如果CR4.PGE = 1,则可以在地址空间之间共享指定的转换。有关更多信息,请参见第4.10.2.4节。CR4.PCIDE启用4级分页和5级分页的进程上下文标识符(PCID)。 PCID允许逻辑处理器为多个线性地址空间缓存信息。有关更多信息,请参见第4.10.1节。CR4.SMEP允许保护页面免受特权模式指令的访问。如果CR4.SMEP = 1,则在特权模式下运行的软件无法从用户模式下可访问的线性地址中获取指令。第4.6节说明了如何确定访问权限,包括主管模式访问和用户模式可访问性的定义。CR4.SMAP允许保护页面免受特权模式的数据访问。如果CR4.SMAP = 1,则在特权模式下运行的软件无法访问在用户模式下可以访问的线性地址处的数据。软件可以通过设置EFLAGS.AC来覆盖此保护。第4.6节说明了如何确定访问权限,包括主管模式访问和用户模式可访问性的定义。CR4.PKE和CR4.PKS允许基于保护密钥指定访问权限。 4级分页和5级分页将每个线性地址与一个保护密钥相关联。当CR4.PKE = 1时,PKRU寄存器为每个保护锁指定是否可以读取或写入带有该保护锁的用户模式线性地址。当CR4.PKS = 1时,IA32_PKRS MSR对特权模式线性地址执行相同的操作。有关更多信息,请参见第4.6节。CR4.CET支持控制流实施技术,包括影子堆栈功能。如果CR4.CET = 1,则某些内存访问被标识为影子堆栈访问,某些线性地址转换为影子堆栈页面。第4.6节说明了如何确定这些访问权限和页面的访问权限。 (仅当还设置了CR0.WP时,处理器才允许设置CR4.CET。)IA32_EFER.NXE为PAE分页/4级分页5级分页模式下启用执行禁用访问权限。如果IA32_EFER.NXE = 1,则可以防止从指定的线性地址进行指令提取(即使允许从该地址读取数据)。第4.6节说明了如何确定访问权限。 (IA32_EFER.NXE对于32位分页无效。要使用此功能限制从可读页中提取指令的软件,必须使用PAE分页,4级分页或5级分页。)

4种分页模式都使用层次化(hierarchical)的分页结构(paging structures)来进行地址转化: 在这里插入图片描述

分页结构(paging structures)中重要的标志:

P flag(bit 0)。如果遇到标记为“不存在”的分页结构条目(因为其P标志位0)被清除或保留位被置位,则会发生这种情况。在这种情况下,线性地址没有任何转换。访问该地址会导致页面错误异常(请参见第4.7节)。PS(page size) flag(bit 7)。如果线性地址中剩余的位数超过12位,请查询当前页面结构条目的位7(PS-页面大小)。如果该位为0,则该条目引用另一个分页结构(paging structure);否则为0。如果该位为1,则该条目映射一个页面(page),这种页就是huge page大小为1G/2M/4M。如果线性地址中仅剩余12位,则当前的页面结构条目将始终映射页面(bit 7用于其他目的),这种就是普通的page大小为4k。 1.2 4-LEVEL PAGING和5-LEVEL PAGING模式

在64bit模式下,每一级分页结构(paging structure)条目(entry)的大小为8字节,一个page 4k最多容纳的entry数量为512(2^9)个,所以每一级分页结构提供的寻址长度为9bit。最后一级页帧(page frame)的寻址长度为12bit(4k)。

对应64bit模式下的两种分页模式(paging modes):

1、4-LEVEL PAGING。线性地址为48bit (9+9+9+9+12),物理地址为52bit。在linux下的分页结构为:pgd→pud→pmd→pte→page(4k)。2、5-LEVEL PAGING。线性地址为57bit (9+9+9+9+9+12),物理地址为52bit。在linux下的分页结构为:pgd→p4d→pud→pmd→pte→page(4k)。 1.2.1 4-LEVEL PAGING

4-LEVEL PAGING下还支持4K/2M/1G几种page模式,下面是其分页结构(paging structure)的层次图:

1、4-LEVEL PAGING 4k-page (pgd→pud→pmd→pte→page(4k)) 在这里插入图片描述

2、4-LEVEL PAGING 2M-page (pgd→pud→pmd→page(2M)) 在这里插入图片描述

3、4-LEVEL PAGING 1G-page (pgd→pud→page(1G)) 在这里插入图片描述

在上述模式下,CR3和分页结构条目(paging structure entry)的格式总览: 在这里插入图片描述

下面小节,阐述格式的详细含义。

1.2.2 CR3 format

两种分页模式都使用使用CR3内容定位的内存中分页结构的层次结构转换线性地址,CR3的内容用于定位第一个分页结构。 对于4级分页,这是PML4表,对于5级分页,它是PML5表。在Linux下都称为PGD(page global directory)。

CR4.PCIDE = 0 时的CR3格式 Bit Position(s)ContentsDescript2:0Ignored-3(PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table during linearaddress translation (see Section 4.9.2)page级的write-through属性4(PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table during linear-address translation (see Section 4.9.2)page级的cache disable属性11:5Ignored-M–1:12Physical address of the 4-KByte aligned PML4 table or PML5 table used for linear-address translation14k对齐的PML4/PML5物理地址63:MReserved (must be 0)M为物理地址,最大为52bit CR4.PCIDE = 1 时的CR3格式 Bit Position(s)ContentsDescript11:0PCID (see Section 4.10.1)-M–1:12Physical address of the 4-KByte aligned PML4 table or PML5 table used for linear-address translation14k对齐的PML4/PML5物理地址63:MReserved (must be 0)M为物理地址,最大为52bit 1.2.3 PML5/PGD entry format

x86下的PML5对应linux下的PGD(page global directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to reference a PML4 table为1指向一个PML4 table1 (R/W)Read/write; if 0, writes may not be allowed to the 256-TByte region controlled by this entry (see Section 4.6)256T区域的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 256-TByte region controlled by this entry (see Section 4.6)256T区域的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the PML4 table referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the PML4 table referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)指示当前entry是否用过做地址转换6Ignored-7 (PS)Reserved (must be 0)-11:8Ignored-M–1:12Physical address of 4-KByte aligned PML4 table referenced by this entry4k对齐的PML4物理地址51:MReserved (must be 0)-62:52Ignored-63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 256-TByte region controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,256T区域的可执行权限disable配置 1.2.4 PML4/P4D entry format

x86下的PML4对应linux下的P4D(page four directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to reference a page-directory-pointer table为1指向一个PDPT1 (R/W)Read/write; if 0, writes may not be allowed to the 512-GByte region controlled by this entry (see Section 4.6)512G区域的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 512-GByte region controlled by this entry (see Section 4.6)512G区域的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the page-directory-pointer table referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the page-directory-pointer table referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)指示当前entry是否用过做地址转换6Ignored-7 (PS)Reserved (must be 0)-11:8Ignored-M–1:12Physical address of 4-KByte aligned page-directory-pointer table referenced by this entry4k对齐的PDPT物理地址51:MReserved (must be 0)-62:52Ignored-63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 512-GByte region controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,512G区域的可执行权限disable配置 1.2.5 PDPT/PUD entry format (1-GByte Page)

x86下的PDPT(Page-Directory-Pointer-Table)对应linux下的PUD(page upper directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to map a 1-GByte page映射1G page必须设置为11 (R/W)Read/write; if 0, writes may not be allowed to the 1-GByte page referenced by this entry (see Section 4.6)1G page的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte page referenced by this entry (see Section 4.6)1G page的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether software has accessed the 1-GByte page referenced by this entry (see Section 4.8)指示软件是否访问过当前entry对应的1G page区域6 (D)Dirty; indicates whether software has written to the 1-GByte page referenced by this entry (see Section 4.8)指示软件是否写入过当前entry对应的1G page区域7 (PS)Page size; must be 1 (otherwise, this entry references a page directory; see Table 4-17)必须置18 (G)Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise如果CR4.PGE = 1,定义地址转换是否是全局的11:9Ignored-12 (PAT)Indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section 4.9.2)1间接决定1G page的memory type29:13Reserved (must be 0)-(M–1):30Physical address of the 1-GByte page referenced by this entry1G对齐的page物理地址51:MReserved (must be 0)-58:52Ignored-62:59Protection key if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 4.6.2); otherwise, it is not used to control access rights.Protection key,如果CR4.PKE = 1 or CR4.PKS = 1,控制page的访问权限63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,1G page的可执行权限disable配置 1.2.6 PDPT/PUD entry format (Page Directory)

x86下的PDPT(Page-Directory-Pointer-Table)对应linux下的PUD(page upper directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to reference a page directory为1指向一个PD1 (R/W)Read/write; if 0, writes may not be allowed to the 1-GByte region controlled by this entry (see Section 4.6)1G区域的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte region controlled by this entry (see Section 4.6)1G区域的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the page directory referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)指示当前entry是否用过做地址转换6Ignored-7 (PS)Page size; must be 0 (otherwise, this entry maps a 1-GByte page; see Table 4-16)必须置011:8Ignored-(M–1):12Physical address of 4-KByte aligned page directory referenced by this entry4k对齐的PD物理地址51:MReserved (must be 0)-62:52Ignored-63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte region controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,1G区域的可执行权限disable配置 1.2.7 PD/PMD entry format (2-MByte Page)

x86下的PD(Page-Directory)对应linux下的PMD(page middle directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to map a 2-MByte page映射2M page必须设置为11 (R/W)Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 4.6)2M page的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section 4.6)2M page的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 4.8)指示软件是否访问过当前entry对应的2M page区域6 (D)Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 4.8)指示软件是否写入过当前entry对应的2M page区域7 (PS)Page size; must be 1 (otherwise, this entry references a page table; see Table 4-19)必须置18 (G)Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise如果CR4.PGE = 1,定义地址转换是否是全局的11:9Ignored-12(PAT) Indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section 4.9.2)间接决定2M page的memory type20:13Reserved (must be 0)-(M–1):21Physical address of the 2-MByte page referenced by this entry2M对齐的page物理地址51:MReserved (must be 0)-58:52Ignored-62:59Protection key if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 4.6.2); otherwise, it is not used to control access rights.Protection key,如果CR4.PKE = 1 or CR4.PKS = 1,控制page的访问权限63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,2M page的可执行权限disable配置 1.2.8 PD/PMD entry format (Page Directory)

x86下的PD(Page-Directory)对应linux下的PMD(page middle directory):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to reference a page table为1指向一个PT1 (R/W)Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 4.6)2M区域的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see Section 4.6)2M区域的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the page table referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether this entry has been used for linear-address translation (see Section 4.8)指示当前entry是否用过做地址转换6Ignored-7 (PS)Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 4-18)必须置011:8Ignored-(M–1):12Physical address of 4-KByte aligned page table referenced by this entry2M对齐的PD物理地址51:MReserved (must be 0)-62:52Ignored-63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,2M区域的可执行权限disable配置 1.2.9 PTE format (4-KByte Page)

在x86和linux下都称为PTE(Page-Table Entry):

Bit Position(s)ContentsDescript0 (P)Present; must be 1 to map a 4-KByte page映射4k page必须设置为11 (R/W)Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 4.6)4k page的写权限控制2 (U/S)User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section 4.6)4k page的用户模式访问允许3 (PWT)Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 4.9.2)page级的write-through属性4 (PCD)Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 4.9.2)page级的cache disable属性5 (A)Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 4.8)指示软件是否访问过当前entry对应的4k page区域6 (D)Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 4.8)指示软件是否写入过当前entry对应的4k page区域7 (PAT)Indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 4.9.2)间接决定4k page的memory type8 (G)Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise如果CR4.PGE = 1,定义地址转换是否是全局的11:9Ignored-(M–1):12Physical address of the 4-KByte page referenced by this entry4k对齐的page物理地址51:MReserved (must be 0)-58:52Ignored-62:59Protection key if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 4.6.2); otherwise, it is not used to control access rights.Protection key,如果CR4.PKE = 1 or CR4.PKS = 1,控制page的访问权限63 (XD)If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)如果IA32_EFER.NXE = 1,4k page的可执行权限disable配置 1.3 Access Right 1.3.1 访问模式

1、访问模式

对线性地址的每次访问分为特权模式访问(supervisor-mode access)和用户模式访问(user-mode access)。对于所有指令获取(instruction fetches)和大多数数据访问(data accesses),此区别由当前特权级别(CPL)确定:

CPL < 3,时进行的访问是特权模式访问(supervisor-mode access)CPL = 3,时进行的访问是用户模式访问(user-mode access)

段选择子中最低2bit叫做RPL。作为段选择子的时候,cs和ss比较特殊,它们的RPL代表着当前进程的特权级,因此,二者的RPL又叫CPL。 在这里插入图片描述

隐式访问:一些操作隐式访问具有线性地址的系统数据结构;无论CPL如何,对这些数据结构的最终访问都是特权模式访问。这种访问的示例包括以下内容:访问全局描述符表(GDT)或本地描述符表(LDT)以加载段描述符;提供中断或异常时访问中断描述符表(IDT);并作为任务切换或CPL更改的一部分访问任务状态段(TSS)。无论CPL如何,所有这些访问都称为隐式特权模式访问。 CPLvm_flags转换成具体cpu架构手册描述的paging structures中的格式:

mprotect() → do_mprotect_pkey() → mprotect_fixup() → vma_set_page_prot() → vm_pgprot_modify() → vm_get_page_prot(): pgprot_t vm_get_page_prot(unsigned long vm_flags) { /* (1) 根据protection_map[]查表来把vm_flags转换成pgprot */ pgprot_t ret = __pgprot(pgprot_val(protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) | /* (2) 把vm的protection key转换成架构的protection key */ pgprot_val(arch_vm_get_page_prot(vm_flags))); return arch_filter_pgprot(ret); }

关于protection_map[]的定义:

/* description of effects of mapping type and prot in current implementation. * this is due to the limited x86 page protection hardware. The expected * behavior is in parens: * * map_type prot * PROT_NONE PROT_READ PROT_WRITE PROT_EXEC * MAP_SHARED r: (no) no r: (yes) yes r: (no) yes r: (no) yes * w: (no) no w: (no) no w: (yes) yes w: (no) no * x: (no) no x: (no) yes x: (no) yes x: (yes) yes * * MAP_PRIVATE r: (no) no r: (yes) yes r: (no) yes r: (no) yes * w: (no) no w: (no) no w: (copy) copy w: (no) no * x: (no) no x: (no) yes x: (no) yes x: (yes) yes */ pgprot_t protection_map[16] __ro_after_init = { __P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111, __S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111 }; #define VM_READ 0x00000001 /* currently active flags */ #define VM_WRITE 0x00000002 #define VM_EXEC 0x00000004 #define VM_SHARED 0x00000008

protection_map[]的index就是VM_*几个flag的组合,例如:

__P000的index = 0000, __P001的index = 0001, // VM_READ __P111的index = 0111, // VM_READ|VM_WRITE|VM_EXEC __S000的index = 1000, // VM_SHARED __S111的index = 1111, // VM_SHARED|VM_READ|VM_WRITE|VM_EXEC

而__P000的定义在不同cpu架构下是不一样的,就是具体paging structures中的格式。以x86_64为例:

#define __P000 PAGE_NONE #define __P001 PAGE_READONLY #define __P010 PAGE_COPY #define __P011 PAGE_COPY #define __P100 PAGE_READONLY_EXEC #define __P101 PAGE_READONLY_EXEC #define __P110 PAGE_COPY_EXEC #define __P111 PAGE_COPY_EXEC #define __S000 PAGE_NONE #define __S001 PAGE_READONLY #define __S010 PAGE_SHARED #define __S011 PAGE_SHARED #define __S100 PAGE_READONLY_EXEC #define __S101 PAGE_READONLY_EXEC #define __S110 PAGE_SHARED_EXEC #define __S111 PAGE_SHARED_EXEC ↓ #define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED) #define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_NX) #define PAGE_SHARED_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | \ _PAGE_USER | _PAGE_ACCESSED) #define PAGE_COPY_NOEXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_NX) #define PAGE_COPY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED) #define PAGE_COPY PAGE_COPY_NOEXEC #define PAGE_READONLY __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_NX) #define PAGE_READONLY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ _PAGE_ACCESSED) ↓ // 和x86手册中的定义一一对应 #define _PAGE_BIT_PRESENT 0 /* is present */ #define _PAGE_BIT_RW 1 /* writeable */ #define _PAGE_BIT_USER 2 /* userspace addressable */ #define _PAGE_BIT_PWT 3 /* page write through */ #define _PAGE_BIT_PCD 4 /* page cache disabled */ #define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */ #define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */ #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_PAT 7 /* on 4KB pages */ #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ #define _PAGE_BIT_SOFTW1 9 /* available for programmer */ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ #define _PAGE_BIT_PKEY_BIT3 62 /* Protection Keys, bit 4/4 */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

arch_vm_get_page_prot()也实现了protection key的格式转换:

#define arch_vm_get_page_prot(vm_flags) __pgprot( \ ((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) | \ ((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) | \ ((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) | \ ((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0)) ↓ #define _PAGE_PKEY_BIT0 (_AT(pteval_t, 1) vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) { error = prot_none_walk(vma, start, end, newflags); if (error) return error; } /* * If we make a private mapping writable we increase our commit; * but (without finer accounting) cannot reduce our commit if we * make it unwritable again. hugetlb mapping were accounted for * even if read-only so there is no need to account for them here */ /* (3.5.2) 对写属性的一些处理 */ if (newflags & VM_WRITE) { /* Check space limits when area turns into data. */ if (!may_expand_vm(mm, newflags, nrpages) && may_expand_vm(mm, oldflags, nrpages)) return -ENOMEM; if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| VM_SHARED|VM_NORESERVE))) { charged = nrpages; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; newflags |= VM_ACCOUNT; } } /* * First try to merge with previous and/or next vma. */ /* (3.5.3) 如果属性一致,尝试合并vma */ pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *pprev = vma_merge(mm, *pprev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), vma->vm_userfaultfd_ctx); if (*pprev) { vma = *pprev; VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY); goto success; } *pprev = vma; /* (3.5.4) 如果属性不一致,需要分割vma */ if (start != vma->vm_start) { error = split_vma(mm, vma, start, 1); if (error) goto fail; } if (end != vma->vm_end) { error = split_vma(mm, vma, end, 0); if (error) goto fail; } success: /* * vm_flags and vm_page_prot are protected by the mmap_sem * held in write mode. */ /* (3.5.5) 设置新的vma->vm_flags */ vma->vm_flags = newflags; /* (3.5.6) 计算是否需要write notify */ dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); /* (3.5.7) 根据新的vma->vm_flags计算vma->vm_page_prot */ vma_set_page_prot(vma); /* (3.5.8) 更改vma页面属性 */ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable, 0); /* * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major * fault on access. */ /* (3.5.9) 私有VM_LOCKED VMA变为可写:触发COW以避免访问时的重大故障。 */ if ((oldflags & (VM_WRITE | VM_SHARED | VM_LOCKED)) == VM_LOCKED && (newflags & VM_WRITE)) { populate_vma_page_range(vma, start, end, NULL); } vm_stat_account(mm, oldflags, -nrpages); vm_stat_account(mm, newflags, nrpages); perf_event_mmap(vma); return 0; fail: vm_unacct_memory(charged); return error; } |→ /* * Some shared mappigns will want the pages marked read-only * to track write events. If so, we'll downgrade vm_page_prot * to the private version (using protection_map[] without the * VM_SHARED bit). * 一些共享页面希望把page标记成只读以便用来追踪写入事件,如果是这样,我们将vm_page_prot降级为私有版本(使用没有VM_SHARED位的 protection_map[])。 */ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) { vm_flags_t vm_flags = vma->vm_flags; const struct vm_operations_struct *vm_ops = vma->vm_ops; /* If it was private or non-writable, the write bit is already clear */ /* (3.5.6.1) 必须是可写的共享页面 */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0; /* The backer wishes to know when pages are first written to? */ /* (3.5.6.2) 有把page恢复成可写的函数 */ if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) return 1; /* The open routine did something to the protections that pgprot_modify * won't preserve? */ /* (3.5.6.3) prot不能发生改变 */ if (pgprot_val(vm_page_prot) != pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags))) return 0; /* Do we need to track softdirty? */ /* (3.5.6.4) */ if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY)) return 1; /* Specialty mapping? */ /* (3.5.6.5) */ if (vm_flags & VM_PFNMAP) return 0; /* Can the mapping track the dirty pages? */ /* (3.5.6.6) */ return vma->vm_file && vma->vm_file->f_mapping && mapping_cap_account_dirty(vma->vm_file->f_mapping); } |→ /* Update vma->vm_page_prot to reflect vma->vm_flags. */ void vma_set_page_prot(struct vm_area_struct *vma) { unsigned long vm_flags = vma->vm_flags; pgprot_t vm_page_prot; /* (3.5.7.1) 根据vma->vm_flags计算vma->vm_page_prot */ vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags); /* (3.5.7.2) 如果需要开启writenotify功能,则对vm_page_prot降级成私有 */ if (vma_wants_writenotify(vma, vm_page_prot)) { vm_flags &= ~VM_SHARED; vm_page_prot = vm_pgprot_modify(vm_page_prot, vm_flags); } /* remove_protection_ptes reads vma->vm_page_prot without mmap_sem */ /* (3.5.7.3) 更新vma->vm_page_prot */ WRITE_ONCE(vma->vm_page_prot, vm_page_prot); } |→ change_protection_range() → change_protection_range() → change_p4d_range() → change_pud_range() → change_pmd_range() → change_pte_range() → pte_modify() 2.4 writenotify

这个问题是这样引入的:

通过mmap()可以把文件内容直接映射到进程空间,大概有两种相关的映射类型:

MAP_SHARED对映射区域的写入数据会复制回文件内,而且允许其他映射该文件的进程共享。 MAP_PRIVATE 对映射区域的写入操作会产生一个映射文件的复制,即私人的“写入时复制”(copy on write)对此区域作的任何修改都不会写回原来的文件内容。

面对MAP_SHARED的共享映射区域,在修改完对应内存的内容以后PTE变脏(DirtyPTE),需要文件缓存同步系统能感知到,即标记对应的page结构为脏页(DirtyPage),在合适时机把改动内容刷回到文件存储当中。

其实就是PTE中的Dirty标志D,怎么能通知到Page的Dirty改动。系统使用了以下的技巧来实现这个机制:

1、如果是共享可写页面(VM_WRITE|VM_SHARED),将其降级为私有只读页面: mmap() → do_mmap() → mmap_region() → vma_set_page_prot() void vma_set_page_prot(struct vm_area_struct *vma) { unsigned long vm_flags = vma->vm_flags; pgprot_t vm_page_prot; vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags); /* (1) 共享可写页面需要writenotify功能,需要降级处理 */ if (vma_wants_writenotify(vma, vm_page_prot)) { vm_flags &= ~VM_SHARED; vm_page_prot = vm_pgprot_modify(vm_page_prot, vm_flags); } /* remove_protection_ptes reads vma->vm_page_prot without mmap_sem */ WRITE_ONCE(vma->vm_page_prot, vm_page_prot); } 2、对降级页面进行写操作,会触发异常。在异常处理中,将页面设置可写,并设置page dirty: __handle_mm_fault() → handle_pte_fault() → do_shared_fault() → do_page_mkwrite() → vmf->vma->vm_ops->page_mkwrite(): int filemap_page_mkwrite(struct vm_fault *vmf) { struct page *page = vmf->page; struct inode *inode = file_inode(vmf->vma->vm_file); int ret = VM_FAULT_LOCKED; sb_start_pagefault(inode->i_sb); vma_file_update_time(vmf->vma); lock_page(page); if (page->mapping != inode->i_mapping) { unlock_page(page); ret = VM_FAULT_NOPAGE; goto out; } /* * We mark the page dirty already here so that when freeze is in * progress, we are guaranteed that writeback during freezing will * see the dirty page and writeprotect it again. */ /* 设置page为dirty */ set_page_dirty(page); wait_for_stable_page(page); out: sb_end_pagefault(inode->i_sb); return ret; } __handle_mm_fault() → handle_pte_fault() → do_shared_fault() → finish_fault() → alloc_set_pte() → maybe_mkwrite(): static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) { /* 恢复页面为可写状态 */ if (likely(vma->vm_flags & VM_WRITE)) pte = pte_mkwrite(pte); return pte; } 3、在系统回刷完dirty page以后,重新将页面降级成只读: do_writepages() → ext4_writepages() → mpage_prepare_extent_to_map() → mpage_process_page_bufs() → mpage_submit_page() → clear_page_dirty_for_io() → page_mkclean() → page_mkclean_file() → page_mkclean_one() → pte_wrprotect() static inline pte_t pte_wrprotect(pte_t pte) { return pte_clear_flags(pte, _PAGE_RW); } 2.5 mm切换

在进程切换时,页表映射也需要切换到新的地址空间,即新进程的mm->pgd需要加载进cr3寄存器。

schedule() → __schedule() → context_switch() → switch_mm_irqs_off(): void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { load_new_mm_cr3(next->pgd, new_asid, true); } 参考资料:

1、Intel® 64 and IA-32 architectures software developer’s manual 2、linux如何感知通过mmap进行的文件修改 3、Linux文件系统 4、Linux分页机制之概述 5、页式存储管理 6、linux中的分页机制 7、mmap和msync相关的一个问题



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3